Search CORE

93 research outputs found

Loom: Exploiting Weight and Activation Precisions to Accelerate Convolutional Neural Networks

Author: Judd Patrick
Lascorz Alberto Delmas
Moshovos Andreas
Sharify Sayeh
Siu Kevin
Publication venue
Publication date: 16/05/2018
Field of study

Loom (LM), a hardware inference accelerator for Convolutional Neural Networks (CNNs) is presented. In LM every bit of data precision that can be saved translates to proportional performance gains. Specifically, for convolutional layers LM's execution time scales inversely proportionally with the precisions of both weights and activations. For fully-connected layers LM's performance scales inversely proportionally with the precision of the weights. LM targets area- and bandwidth-constrained System-on-a-Chip designs such as those found on mobile devices that cannot afford the multi-megabyte buffers that would be needed to store each layer on-chip. Accordingly, given a data bandwidth budget, LM boosts energy efficiency and performance over an equivalent bit-parallel accelerator. For both weights and activations LM can exploit profile-derived perlayer precisions. However, at runtime LM further trims activation precisions at a much smaller than a layer granularity. Moreover, it can naturally exploit weight precision variability at a smaller granularity than a layer. On average, across several image classification CNNs and for a configuration that can perform the equivalent of 128 16b x 16b multiply-accumulate operations per cycle LM outperforms a state-of-the-art bit-parallel accelerator [1] by 4.38x without any loss in accuracy while being 3.54x more energy efficient. LM can trade-off accuracy for additional improvements in execution performance and energy efficiency and compares favorably to an accelerator that targeted only activation precisions. We also study 2- and 4-bit LM variants and find the the 2-bit per cycle variant is the most energy efficient

arXiv.org e-Print Archive

Crossref

Multi-Grain Coherence Directory

Author: Falsafi Babak
Moshovos Andreas
Zebchuk Jason
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 12/01/2014
Field of study

Conventional directory coherence operates at the finest granularity possible, that of a cache block. While simple, this organization fails to exploit frequent application behavior: at any given point in time, large, continuous chunks of memory are often accessed only by a single core. We take advantage of this behavior and investigate reducing the coherence directory size by tracking coherence at multiple different granularities. We show that such a Multi-grain Directory (MGD) can significantly reduce the required number of directory entries across a variety of different workloads. Our analysis shows a simple dual-grain directory (DGD) obtains the majority of the benefit while tracking individual cache blocks and coarse-grain regions of 1KB to 8KB. We propose a practical DGD design that is transparent to software, requires no changes to the coherence protocol, and has no unnecessary bandwidth overhead. This design can reduce the coherence directory size by 41% to 66% with no statistically significant performance loss. © 2013 ACM

Infoscience - École polytechnique fédérale de Lausanne

ReCast: Boosting tag line buffer coverage in low-power high-level caches "for free"

Author: Falsafi Babak
Moshovos Andreas
Park Won-Ho
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 06/04/2009
Field of study

We revisit the idea of using small line buffers in-front of caches. We propose ReCast, a tiny tag set cache that filters a significant number of tag probes to the L2 tag array thus reducing power. The key contribution in ReCast is S-Shift, a simple indexing function (no logic involved just wires) that greatly improves the utility of line buffers with no additional hardware cost. S-Shift can be viewed as a technique for emulating larger cache blocks and hence exploiting more spatial locality but without paying the penalties of actually using a larger L2 cache block. Using several SPEC CPU2000 applications and a model of an aggressive, dynamically-scheduled, superscalar processor we demonstrate that a practical ReCast organization can significantly reduce power in the L2. Specifically, a 64-entry ReCast comprising eight sub-banks of eight entries each can filter about 50% of all tag probes for a 1Mbyte L2 cache. A conventional line buffer of the same size filters only 32% of all tag probes. The resulting average reduction in L2 tag power is 38% and 85% with writeback or writethrough L1 caches respectively. This translates to a reduction of 16% or 52% of the overall L2 power respectively. We also analyze a few representative applications explaining why S-Shift works well. © 2005 IEEE

Infoscience - École polytechnique fédérale de Lausanne